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Abstract 



An arc-annotated string is a string of characters, called bases, augmented with a set of pairs, 
called arcs, each connecting two bases. Given arc-annotated strings P and Q the arc-preserving 
subsequence problem is to determine if P can be obtained from Q by deleting bases from Q. 
Whenever a base is deleted any arc with an endpoint in that base is also deleted. Arc-annotated 
strings where the arcs are "nested" are a natural model of RNA molecules that captures both the 
primary and secondary structure of these. The arc-preserving subsequence problem for nested 
arc-annotated strings is basic primitive for investigating the function of RNA molecules. Gramm 
et al. [ACM Trans. Algorithms 2006] gave an algorithm for this problem using 0(nm) time 
and space, where m and n are the lengths of P and Q, respectively. In this paper we present 
a new algorithm using 0(nm) time and 0(n + m) space, thereby matching the previous time 
bound while significantly reducing the space from a quadratic term to linear. This is essential to 
process large RNA molecules where the space is likely to be a bottleneck. To obtain our result 
we introduce several novel ideas which may be of independent interest for related problems on 
arc- annotated strings. 

1 Introduction 

An arc- annotated string S is a string augmented with an arc set A$- Each character in S is called a 
base and the arc set As is a set of pairs of positions in S connecting two distinct bases. We say that 
S is a nested arc-annotated string if no two arcs in As share an endpoint and no two arcs cross each 
other, i.e., for all {ii,i r ), (i'i,i' r ) € As we have that i\ < %\ < i r iff i\ < i' r < i r . Given arc-annotated 
strings P and Q we say that P is a arc-preserving subsequence (APS) of Q, denoted P C Q, if P 
can be obtained from Q by deleting or more bases from Q. Whenever a base is deleted any arc 
with an endpoint in that base is also deleted. The arc-preserving subsequence problem (APS) is to 
determine if P C Q. If P and Q are both nested arc-annotated strings we refer to the problem 
as the nested arc-preserving subsequence problem (NAPS). Fig. [D^a) shows an example of nested 
arc-annotated strings. 

Ribonucleic acid (RNA) molecules are often modeled as nested arc-annotated strings. Here, 
the string consists of bases from the 4-letter alphabet {A, U, C,G}, called the primary structure, 
and an arc set consisting of pairings between bases, called the secondary structure. The secondary 
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Figure 1: (a) Nested arc-annotated strings P and Q. Here, P and Q contain arcs connecting their 
first and last bases, (b) The corresponding trees Tp and Tq induced by the arcs. 



structure of RNA is central for many biological functions and it is more preserved in evolution than 
the primary structure [2ll8lll2 1 [T5 l ll6 y 2i] . NAPS is a simple primitive for comparing the secondary 
structure of the RNA molecules. Furthermore, it may also serve as the subroutine in algorithms 
for the more general longest common arc-preserving subsequence problem (LAPCS), see Gramm et 
al. p3]. 

Building on earlier work in a related model of RNA molecules by Vialette [22] , Gramm et al. |13| 
introduced and gave an algorithm for NAPS using 0(nm) time and space, where m and n are the 
lengths of P and Q, respectively. Kida [17] presented an experimental study of this algorithm and 
Damaschke [TU] considered a special restricted case of the problem. 

1.1 Results 

We assume a standard unit-cost RAM model with word size G(logn) and a standard instruction 
set including arithmetic operations, bitwise boolean operations, and shifts. The space complexity 
is the number of words used by the algorithm. All of the previous results are in same model of 
computation. Throughout the paper P and Q are nested arc-annotated strings of lengths m and 
n, respectively. In this paper we present a new algorithm with the following complexities. 

Theorem 1 Given nested arc-annotated strings P and Q of lengths m and n, respectively, we can 
solve the nested arc-preserving subsequence problem in time O(nm) and space 0(n + m). 

Hence, we match the running time of the currently fastest known algorithm and at the same time we 
improve the space from 0(nm) to 0(n + m). This space improvement is critical for processing large 
RNA molecules. In particular, an algorithm using 0(nm) space quickly becomes infeasible, even for 
moderate sizes of RNA molecules, due to costly accesses to external memory. An algorithm using 
0(m + n) space is much more scalable and allows us to handle significantly larger RNA molecules. 
Furthermore, we note that obtaining an algorithm using 0(nm) time and o(nm) space is mentioned 
as an open problem in Gramm et al. [13J. 

Compared to the previous work by Gramm et al. [13] our algorithm is not only more space- 
efficient but also simpler. Our algorithm is based on a single unified dynamic programming re- 
currence, whereas the algorithm by Gramm et al. requires computing and tabulating auxiliary 
information in multiple phases mixed with dynamic programming. Our approach allows us to 
better expose the features of NAPS and is essential for obtaining a linear space algorithm. 
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1.2 Techniques 

As mentioned above, our algorithm is based on a new dynamic programming recurrence. Essentially, 
the recursion expresses for any pair of substrings P' and Q' of P and Q, respectively, the longest 
prefix of P' which is an arc-preserving subsequence of Q' in term of smaller substrings of P' and 
Q' . We combine several new ideas with well-known techniques to convert our recurrence into an 
efficient algorithm. 

First, we organize the dynamic programming recurrence into T sequences. A T sequence for 
a given substring Q' of Q is a simple 0(m) space representation of the longest arc-preserving 
subsequences of each prefix of P in Q' . We show how to efficiently manipulate V sequences to get 
new r sequences using a small set of simple operations, called the primitive operations. Secondly, 
we organize the computation of V sequences using a recursive algorithm that traverses the tree 
structure of the arcs in Q. The algorithm computes the V sequence for each arc in Q using the 
primitive operations. To avoid storing too many T sequences during the traversal we direct the 
computation according to the well-known heavy-path decomposition of the tree. This leads to an 
algorithm that stores at most 0(log \Aq\) T sequences. Since each T sequence uses 0{m) space the 
total space becomes 0{m log \Aq \ + n). 

Finally, to achieve linear space we exploit a structural property of T sequences to compress them 
efficiently. We obtain a new representation of T sequences that only requires 0(m) bits. Plugging in 
the new representation into our algorithm the total space becomes 0(n + m) as desired. However, 
the resulting algorithm requires many costly compressions and decompressions of T sequences at 
each arc in the traversal. As a practical and more elegant solution we show how to augment the 
compressed representation of T sequences using standard rank/select indices to obtain constant 
time random access to elements in T sequences. This allows us to compress each T sequence only 
once and avoid decompression entirely without affecting the complexity of the algorithm. 

1.3 Related Work 

Arc-annotated strings are a natural model of RNA molecules that captures both the primary and 
secondary structure of these. Consequently, a wide range of pattern matching problems for them 
have been studied, see e.g., (TJ [3J [4l [Tl [TlJ [19] . Among these, NAPS is one of the most basic 
problems. 

The NAPS problem generalizes the tree inclusion problem for ordered trees [5|[9|ll8j. Here, the 
goal is to determine if a tree can be obtained from another tree by deleting nodes. This is equivalent 
to NAPS where all bases in both strings have an incident arc. The authors have shown how to solve 
the tree inclusion problem in time 0(nm/ log n + n log n) and space 0{n + m) [5J. Compared to our 
current result for NAPS the space complexity is the same but the time complexity for tree inclusion 
is a factor O(logn) better for most values of m and n. Though our obtained complexities for the 
tree inclusion problem and NAPS are very similar, the ideas and techniques behind the results 
differ significantly. While the definition of the two problems seems very similar it appears that 
the more general NAPS is significantly more complicated. We leave it as an interesting research 
direction to determine the precise relationship between NAPS and the tree inclusion problem. 

Several generalizations of NAPS have also been studied relaxing the requirement that arcs 
should be nested [611114 [13]. In nearly all cases the resulting problem becomes NP-complete. 
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Figure 2: An embedding of P in Q. /(I) = 1, /(2) = 2 and f(j) = j + 2 for j = 3, 4, . . . , 9. ^ P = 
{(1,9), (3, 5), (7, 8)} and = {(1, 11), (3, 8), (5, 7), (9, 10)}. We have (/(l),/(9)) = (1,11) G Aq, 
(/(3), /(5)) = (5, 7) G and (/(7), /(8)) = (9, 10) G A Q . 

1.4 Outline 

In Sec. [2] we give some preliminaries and define our notation. Sec. [3] contains our dynamic pro- 
gramming recurrence. In Sec. |4]we present our main algorithm achieving 0(mlog \Aq\ + n) space. 
Finally, in Sec. [5] we show how to compress the V sequences stored by our algorithm to obtain 
0{n + m) space. 

2 Preliminaries and Notation 

Let S be an arc-annotated string with arc set As- The length of S is the number of bases in S 
and is denoted \S\. We will assume that our input strings P and Q have the arcs (1, |P|) and 
(1, \Q\), respectively. If this is not the case we may always add additional connected bases to the 
start and end of P and Q without affecting the solution or complexity of the problem. We do this 
only to ensure that the nesting of the arcs form a tree (rather than a forest) which simplifies the 
presentation of our algorithm. 

The arc- annotated substring S\i\,i2[, 1 < i\,%2 < \S\, is the string of bases starting at i\ and 
ending at i%. The arc set associated with S[i\,i2\ is the subset of As of arcs with both endpoints 
in [11,12]- We define S[ii] = S[ii,ii] and S^i,^] = e (the empty string) if i\ > %i. Note the arc 
set of an arc-annotated string of length < 1 is also empty. A split of S is a partition of S into two 
substrings S[l,i] and S[i + 1, for some i, < i < \S\. The split is an arc-preserving split if no 
arcs in As cross i, i.e., all arcs either have both endpoints in <S[l,i] or S[i + 1, \S\]. We say that 
the index i induces a (arc-preserving) split of S. 

An embedding of P in Q is an injective function / : {1, . . . , m} — > {1, . . . , n} such that 

1. for all j € {1, . . . , m}, P[j] = Q[f(j)]- (base match condition) 

2. for all indices ji,j r G {1, ... ,m}, (ji,j r ) £^p« (f(ji),f(jr)) G Aq. (arc match condition) 

3. for all i, j G {1, . . . ,m}, i < j 44> f(i) < f(j). (order condition) 

If f{j) = * we sa Y that j is matched to i in the embedding. From the definition of arc-preserving 
subsequences we have that P C Q iff there is an embedding of P in Q. Figure [3] gives an example 
of an embedding. 
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Figure 3: Illustation of the Splitting Lemma. Index i = 8 induces an arc-preserving split of Q' 
with Qi = Q'[l,8] and Q2 = Q'[9, 11]. Index j = 6 induces an arc-preserving split of P' with 
p x = P'[1, 6] and P 2 = P'[7, 9] such that Pi C Qi and P 2 Q Q 2 - 

3 The Dynamic Programming Recurrence 

In this section we give our dynamic programming recurrence for NAPS. Essentially, the recursion 
expresses for any pair of substrings P' and Q' of P and Q, respectively, the longest prefix of P' 
which is an arc-preserving subsequence of Q' in terms of smaller substrings of P' and Q' . 
We show the following key properties of arc-preserving splits. 

Lemma 1 (Splitting Lemma) Let P' and Q' be arc-annotated substrings of P and Q, respec- 
tively, and let (Qi,Q2) be any arc-preserving split of Q' . 

(i) If P' C Q' then there exists an arc-preserving split (Pi,P 2 ) of P' such that P\ C Qi and 
P2 Q Q2. 

(ii) Let (Pi,P 2 ) be an arc-preserving split of P' . Then P\ C Q± and P2 C Q2 => P' E Q' ■ 

Proof. Let P> = P[j 1 ,j 2 ], Q' = Q[hM, Qi = QM, and Q 2 = Q[i + l,i 2 ]. 

To prove (i), let / be an embedding of P' in Q' (such an embedding exists since P' C Q'). 
Let j be the largest index such that f(j) <G i.e., f{j) is a base in Q\. It follows that 

fi ■ -> {n, ■■■,*}> where /i(x) = /(x), is an embedding of P[ji,j] in Qi, and thus 

P[h,j] E Qi- Similarly, f 2 : {j + 1, . . . , j 2 } -> + 1, • • • , ^2}, where / 2 (x) = f(x), is an embedding 
of P[j + 1, j 2 ] in Q2 5 and therefore P[j + 1, j 2 ] E Q2- We have now shown that there exists a split 
(Pi,P 2 ), with Pi = P[ji,j] and P 2 = P[j + 1, j 2 ], such that Pi C Qi and P 2 C Q2- It remains to 
show that this is an arc-preserving split, i.e., that there are no arcs from P\ji,j] to P[j + l,j 2 ] in 
Ap>. For contradiction assume that there exists an arc (ji,j r ) with ji G [ii, j] and j r G [j + l,j 2 ]. 
By the definition of /1 and f 2 we have f(ji) = f\(ji) € [h,i] and /(j r ) = /2OV) G [i + l,^]- 
Since / is an embedding of P' in Q' it follows from the arc match condition that there is an arc 
(f{jl),f(jr)) £ Aj'- But this contradicts the fact that (Qi,Q2) is an arc-preserving split of Q'. 
Thus there can be no such arc and it follows that j induces an arc-preserving split of P' . 

To prove (ii), assume Pi C Q\ and P 2 C Q 2 . Let /1 be the embedding of Pi in Q[i±,i], let / 2 
be the embedding of P 2 in Q[i + 1, i 2 ], and let j be the index such that Pi = P\ji,j]. We will show 
that the embedding / : . . . , j 2 } -> {h, . . . , i 2 }, 

/(x) = //iW 

\/ 2 (x) xG[j + l,i 2 ] 

is an embedding of P' in Q'. 
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We have that / satisfies the base match condition and the order condition. It remains to 
show that it satisfies the arc match condition. Since j induces an arc-preserving split of P' we 
have A P , = A Pl U Ap 2 . Let (ji,j r ) be an arc in A P >. If (Ji,j r ) G A Pl , then (f(ji),f(j r )) = 
(fi(jl),fi(jr)) G A Ql , since /i is an embedding of Pi in Q x . Similarly, if (ji,j r ) G Ap 2 , then 
(f(jl),f(jr)) = (f2(ji),f2(jr)) G Aq 2 . Thus (j h j r ) G A P , => (f(ji),f(j r )) G A QI . By the same 
kind of argument it follows that since i induces an arc-preserving split of Q' , (f(ji), /(>)) G ^4q/ => 

(ju'r)eV. □ 

We now define 7, which we will use to give our dynamic programming recurrence for NAPS. 

Definition 1 For 1 < ji < m, I G {1,2}, and 1 < ii < 12 < n, define j(ji, 32, h, ^2) to 6e i/ie 
largest integer k such that P[j\,k] Q Q[ii,i2] and k induces an arc-preserving split of P[j\, 32}- 

It follows that 7(1, m, 1, n) = m if and only if P C Q. 

The Splitting Lemma gives us a very useful property of 7: The requirement that k induces an 
arc-preserving split of P[ji, 32] in the definition of 7 implies that if there exists an embedding / of 
P[k + 1,32] in Q[i2, i] for some i then by the Splitting Lemma the embedding of P[j\,k] in Q[i±, 12] 
(which exists by the definition of 7) can be extended with / to get an embedding of P[ji, j'2] in 
Q[ii,i]. This would not be true if we dropped the requirement that k induces an arc-preserving 
split ofP[ji,j2]. Formally, 

Corollary 1 Let i be an index inducing an arc-preserving split of Q[ii,i2\- Then, 

7(ji,J2,*i,*2) = l(l(ji,32,h,i) + 1,J2,* + 1,^2) • 

Proof. Let k = 7(31,32, k, 12), j = l(ji,j2,ii,i), and k' = j(j + 1, j 2 ,i + l,i 2 )- We want to show 
that k' = k. We will first show that k' < k by using the second part of the Splitting Lemma. Next 
we show that k' > k by using the first part of the splitting lemma. 

Let Qi = Q[ii,i] and Q2 = [i + 1, 12]- By the definition of 7, j and k' we have 

P[ji,i] QQi and P[j + 1, k'} CQ 2 . 

Since P[j'i, fc] C ^2] and j + 1 > ji we have, 

P[j + l,k}QQ[i l ,i 2 ] ■ 

By the definition of 7, index j induces an arc-preserving split of P[ji, 32]- Since k' < 32 index j also 
induces an arc-preserving split of P[ji, k'\. By the Splitting Lemma (ii) we have P\j\, k'\ C Q[h, 12]- 
Since k' induces an arc-preserving split of P [j + 1 , j'2] and j induces an arc-preserving split of P [j\ , j'2] 
we have that k' induces an arc-preserving split of P[ji, j'2]. This, together with P[ji, fc'] C Q[ii, 12}, 
implies k' < k, since by definition of 7, k is the largest index inducing an arc-preserving split of 
P\ji,32] such that P[ji,k] C Q[«i,i 2 ]. 

To show that k < kl we use the first part of the Splitting Lemma. Since i induces an arc- 
preserving split of Q[i\,i2], by the Splitting Lemma (i) there exists a j' such that P\ji,j'} E Qi 
and P[j' + 1, fe] Q Q2 and f induces an arc-preserving split of P[j\, 32}- Let j* be the largest such 
f . We will show that j* = j(ji, 32, i\, i) = j- This implies P[j + 1, k] = P[j* + 1, k] C Q2 and thus 
k < k' since k' is the largest integer such that P[j + 1, £;'] C Q2- 

By definition j is the largest integer inducing an arc-preserving split of P[ji,j2] such that 
P[ji,j] E Qi and thus j* < j. But this implies P[j + 1, k] C P[j* + 1, C Q 2 - Thus j = j*. □ 
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Intuitively, the corollary says that to compute the largest prefix of P that can be embedded 
in Q we can greedily match the bases and right endpoints of arcs of P as much to the left in Q 
as possible. The dynamic programming recurrence for 7 is as follows. The intuition behind the 
recurrence is that it corresponds to computing the leftmost embedding of P\ji, j 2 ] in Q[h,i2\- 

Base cases. 7(71, j2j*l>*2) is equal to 
'jl - 1 if jl > 32, 

ji if i\ = i 2 and P[ji] = Q[ii] and 

(jl, jr) ^ A P for all j r < j 2 , 
ji - 1 if it = i 2 and (P[ji\ ^ Q[i\] or 

(jl, jr) G Ap for some j r < j 2 ) 

Recursive cases. i\ < i 2 and j\ < j 2 . 

If (ii,i r ) $ Aq for all i r < i 2 then 7(ji, j 2 , ii, i 2 ) is equal to 

l{ji + l,h,h + 1, h) if (jl, jr) A P for all j r < j 2 and P\j{\ = Q[h], (3) 
lijuh, h + 1) k) if (jl, >) G 4p for some j r < j 2 or P[ji] ^ Q[i x ], (4) 

If (ii,« r ) € A3 for some i r < i 2 , then ^/(ji,j 2 ,ii,i 2 ) is equal to 

7(7(jl, h,h,ir) + hh,ir + M2) (5) 
If (ii,«2) G Aq then 7(71, j 2 ,h,i 2 ) is equal to 

max{7(ji,j 2 ,«i + M2), 

7(jl , J2, *i , h ~ 1)} if (ji , jr) 4p for all j r < j 2 , 
r y(jl,j2,h + M2) if (jl, jr) G -4p for some j r < j 2 , 

and P[3i]^Q[*i] or P[j r ] ^ Q[i 2 ], 
max{0,7(ji, j 2 ,h + M2)} if (jl, jr) G 4p for some j r < j 2 , 

P[7a] = Q[ix] and P[j r ] = Q[i 2 ], 

where 




jr if 7(jl + 1, jr - Ml + 1, h ~ 1) = jr ~ 1 

ji — 1 otherwise. 



The cases are visualized in Fig. HJ 

The base cases (1) — (2) cover the cases where P{ji,j 2 ] is the empty string {j 2 > j\) or Q[ii,i 2 ] 
is a single base {i\ = i 2 ). Let k = j(ji,j 2 ,ii,i 2 )- Case (3) and (5) follows directly from Corollary[T] 



(1) 
(2a) 

(26) 



(6) 
(7) 

(8) 
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Figure 4: The main cases from the recurrence relation. Case (3): Neither P or Q starts with an 
arc. Case (4): P starts with an arc, Q does not. Case (5): Q starts with an arc not spanning Q. 
We split Q after the arc and compute 7 first in the first half and then continue the computation in 
the other. Case (6): Q starts with an arc, P does not. Case (7)-(8): Both P and Q starts with an 
arc. 



In case (4) and (7) the base Q[i\] cannot be part of an embedding of P[j\,k] in Q[ii,i 2 ] and thus 
7(iijj2, h, ^2) = lUlihih +1>*2)- In case (6) either Q[i\] or Q[i 2 ], but not both, can be part of an 
embedding of P\j 1} k) in Q[h,i 2 ]. Thus, 7(71, j 2 , ix,h) = max^ji, j 2 , h, %i - 1), 7(71, j 2 , k + 1, i 2 )}- 
Case (8) is the most complicated one. Both Q[i±, i 2 ] and P[ji,j 2 ] start with an arc and the bases of 
the arcs match. An embedding of P[ji, k] into Q[ii, i 2 ] either (i) matches the two arcs, (ii) matches 
the arc (ji, j' r ) an d the rest of P[ji, k] in Q[i\ + 1, i 2 ] or (iii) matches nothing {k = j\ — 1). In case 
(h) lih^hM^-i) = l(jl,h,h + hh)- Case (i) requires that P[ji + l,j r - 1] C Q[i x + l,i 2 - 1]- 
We express this in the recurrence by using an auxiliary function <p which is j r if 7(j'i + 1, j r — 1, i\ + 
1, i 2 — 1) = j r — 1 and j\ — 1 otherwise, since in the last case the arc (ji,j r ) cannot be matched to 
the arc (ii, i 2 ). Since we want the largest match we take the maximum of the two cases (i) and (ii) 
(case (iii) is covered by these two). 

Relation to the Gramm et al. algorithm The recurrence relation is not very different from 
the one implicit in the algorithm by Gramm et al. [13J. Most of the single cases are the same. 
The main difference is that they intermix the description of the algorithm and the recurrence. 
And where we have the requirement that j 2 ,ii,i 2 ) induces an arc-preserving split of P[ji,j 2 ] 
they instead specify a specific order in which to calculate the recurrence and save some auxiliary 
information during the computation. Thus our definition of 7 gives us the possibility to state the 
recurrence relation independently of the algorithm. 

4 The Algorithm 

We now present an algorithm to solve NAPS in 0{nm) time and 0(m log \ Aq \ + n) space. In the 
next section we show how to further reduce the space to 0(n + m) to get Theorem [TJ The result 
relies on a well-known path decomposition for trees applied to arc-annotated strings combined with 
a new idea to organize the dynamic programming recurrence computation. We present these in 



8 



Sections 14.11 and 14.21 before giving the algorithm in Section 14.31 

4.1 Heavy-Path Decomposition of Arc- Annotated Sequences 

Let S be a nested arc-annotated string containing the arc (1, \S\) (recall that we assume that both 
P and Q have this arc). The arcs in A$ induce a rooted and ordered tree T$ rooted at the arc 
(1, \S\) as shown in Fig. [D^b). We use standard tree terminology for the relationship between arcs 
in T$. Let (ii,i r ) be an arc in Ag. The depth of (ii,i r ) is the number of edges on the path from 
(il,i r ) to the root in T5. An arc with no children is a leaf arc and otherwise an internal arc. 
Define Ts(ii,i r ) to be the subtree of T$ rooted at (ii,i r ) and let size(ij, i r ) be the number of arcs in 
Ts{ii,ir)- Note that size(l, \S\) = \As\- If {i'i,i' r ) is an arc i n Ts(ii,i r ), then (ii,i r ) is an ancestor 
of (i'i,i' r ) (note that (ij, i r ) is an ancestor of itself). If (ii,i r ) is an ancestor of (i'^i^), then (i'i,i' r ) is 
a descendant of (ii,i r ). 

As in [T3] we partition T$ into disjoint paths. We classify each arc as either heavy or light. The 
root is light. For each internal arc (ii,i r ) we pick a child (i^,i^) of maximum size and classify it 
as heavy. The remaining children are light. An edge to a light child is a light edge and an edge to 
a heavy child is a heavy edge. Let lightdepth(i/, i r ) denote the number of light edges on the path 
from (ii,i r ) to the root of Tjg. If (i'i,i' r ) is a light child of (i[,i r ), then size(^, i' r ) < size(^, i r )/2 since 
otherwise {i'j,i' r ) would be heavy. Consequently, the number of light edges on a path from the root 
to a leaf is at most logarithmic. Specifically, we will use the following well-known bound for trees 
restated for nested arc-annotated sequences. 

Lemma 2 (Harel and Tarjan [14J ) Let S be a nested arc-annotated string containing the arc 
(1,|5|). For any arc (ii,i r ) £ As? lightdepth(ij, i r ) < log \Ag\ + 0(1). 

Removing the light edges we partition Tg into heavy paths. 

4.2 Manipulating T Sequences 

For positions i% and ii in Q, i\ < 12, define the T sequence for i\ and %i as 

r(ii,«2) = j(m,m,ii,i 2 ),j(m - 1, m, h, i 2 ), ■ ■ ■ , 7(1, m, ii, z 2 ). 

Thus, T(ix,i2) is the sequence of endpoints of the longest prefixes of each suffix of P that is an 
arc-preserving subsequence of Q^i,^]- We can efficiently manipulate T sequences as suggested by 
the following lemma. 

Lemma 3 For any positions i\ and i 2 in Q, i\ < i 2 , we can compute in 0{m) time 

(i) T(i 2 ,i 2 ). 

(ii) T(ii,i 2 ) from Y(i\ + l,i 2 ) if ' (h,i r ) Aq for every i r <i 2 . 

(Hi) T(ii,i 2 ) from T(ii,i r ) and T(i r + l,i 2 ) if (j>i,i r ) G Aq for some i r <i 2 . 
(iv) T(ii,i 2 ) from T(ix,i 2 - 1), T(ii + l,i 2 ), and T(ix + l,i 2 - 1) if (h,i 2 ) G Aq. 
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Extend Combine Meld 

Figure 5: The extend, combine, and meld operations, respectively. For each operation the substring 
range(s) below the string indicate the endpoints of the input T sequence(s) needed in the operation 
to compute the T sequence for the entire string. 

Proof. All the cases follow directly from the dynamic programming recurrence. Case (i) follows 
from case (2) of the recurrence, Case (ii) from case (3) and (4) of the recurrence, Case (iii) from 
case (5) of the recurrence and Case (iv) from case (6)-(8) of the recurrence. □ 

We will use each of 4 cases in Lemma[3]as primitive operations in our algorithm and we refer to (i), 
(ii), (iii), and (iv) as an initialize, an extend, a combine, and a meld operation, respectively. Fig. O 
illustrates the extend, combine, and meld operations. An extend operation from T(i\ + k, 12) to 
r(z'i,i2), for some k > 1, is defined to be the sequence of k extend operations needed to compute 
T(ii,i 2 ) from T(ii + k, i 2 ). 

4.3 The Algorithm 

We now present our main algorithm. Initially, we construct Tq with a heavy path decomposition 
in 0(n) time and space. Then, we recursively compute T sequences for each arc (ii,i r ) € Aq in a 
top-down traversal of Tq. The T sequence for the root contains the value 7(1, m, l,n) and hence 
this suffices to solve NAPS. The main idea is as follows. At each internal arc we first recursively 
compute the V sequence for the heavy child and then compute V sequences for the remaining 
light children in a right-to-left order (we will later see that this processing order is essential for 
the achieving the space bound). At the same time we use the extend and combine operations to 
compute r sequences with a right endpoint at i r or i r — 1 and a left endpoint at positions between 
children in the same left-to-right order. Finally, we use the meld operation on T(i[ + l,i r ) and 
T(ii,i r - 1) to get T(ii,i r ). 

At an arc (ii,i r ) G Aq in the traversal there are two cases to consider: 

Case 1: (i[,i r ) is a leaf arc. We compute T(i[,i r ) as follows. 

1. Initialize Y{i r ,i r ) and T(i r — l,i r — 1). 

2. Extend Y(i r ,i r ) and T(i r — l,i r — 1) to get T(ii + 1, i r ), T(ii,i r — 1), and Y(i[ + 1, i r — 1). 

3. Meld T{ii + l,i r ), T(ii,i r — 1), and T(ii + l,i r — 1) to get T(ii,i r ). 

Case 2: (ii,i r ) is an internal arc. Let . . . , (if, if.) be the childen arcs of (ii,i r ) in left- 

to-right order. To simplify the algorithm we set i® = %. We compute T{ii,i r ) as follows. 

1. Recursively compute Rh := F^f,^), where (i^,^) is the heavy child arc of (ii,i r ). 

2. Initialize Y(i r ,i r ) and T(i r — l,i r — 1). 

3. Extend T(i r ,i r ) and T(i r — l,i r — 1) to get T(if. + l,i r ) and T(if + l,i r — 1). 
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Figure 6: Snapshot of the T sequences computed at an internal arc. The ranges below the arc- 
annotated sequences represent T sequence endpoints. (a) After the recursive call to the heavy child 
in line 1. (b) After the extend operations in line 3. (c) After the recursive call in line 4(a) (d) After 
the combine operations in line 4(b). (e) Before the meld operation in line 6. (f) After the meld 
operation. 

4. For k := s down to 1 do: 

(a) If k ^ h recursively compute Rk :=F(if,i^). 

(b) Combine Rk with T(i^ + 1, i r ) and with T(i% + 1, i r — 1) to get T(if , i r ) and T(if, i r — 1). 

(c) Extend T(if,i r ) and r(tf,i P - 1) to get r(*£ -1 + l,i r ) and r(i* -1 + l,t P - 1). 

5. Extend T(ii + l,i r — 1) to get T(ii,i r — 1). 

6. Meld T(ii + l,i r ), T(ii,i r — 1), and T(ii + l,i r — 1) to get T(ii,i r ). 

The computation in case 2 is illustrated in Fig. [6J Note that when k = 1 in the loop in line 4, line 
4(c) computes T(z° + l,i r ) = T(ii + l,i r ) and r(i° + l,i r — 1) = F(ii + l,i r — 1). In both cases 
above the algorithm computes several local T sequences of the form T(i,i r ) and F(i,i r — 1), for 
some i < i r . These sequences are computed in order of decreasing values of i and each sequence 
only depends on the previous one and recursively computed T sequences. Hence, we only need to 
store a constant number of local sequences during the computation at (ii,i r ). 

4.4 Analysis 

We first consider the time complexity of the algorithm. To do so we bound the total number 
of primitive operations. For each arc in Aq there is 1 initialize and 1 meld operation and for 
each internal arc there is 1 combine operation. Hence, the total number of initialize, meld, and 
combine operations is 0(|j4q|). To count the number of extend operations we first define for any 
arc (ii,i r ) € Aq the set spaces (ii,i r ) as the set of positions inside (ii,i r ) but not inside any child 
arc of (ii, i r ), that is, 

spaces (ii, i r ) = {i \ i[ < i < i r but not if < % < for any child (if,if.) of (ii,i r )}- 

For example, spaces(l, 11) for Q in Fig. QJa) is {1, 2, 11}. The spaces sets for all arcs is a partition of 
the positions in Q and thus Ylfa i r )eA Q spaces (z/, i r ) = n. At an arc (ii,i r ) the algorithm performs 
0(spaces(i/, i r )) extend operations and hence the total number of extend operations is 0(n). By 
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Lemma [3] each primitive operation takes 0(m) time and therefore the total running time of the 
algorithm is 0(|vlQ|m + nm) = 0(nm). 

For the space complexity we bound the number of V sequences stored by the algorithm. When 
the algorithm visits an arc (ii,i r ) we are currently processing a nested sequence of recursive calls 
corresponding to a path p in Tq from the root to (i[,i r ). The number of T sequences stored at each 
of these recursive calls is the total number of T sequences stored. Consider an edge e in p from a 
parent (i'i,i' r ) to a child If e is heavy the recursive call to (i",i") is done in line 1 of case 2 

in the algorithm immediately at the start of the visit to {i' l7 i' r ). Therefore, no V sequence at (i'i,i' r ) 
is stored. If e is light the recursive call to is done in line 4(a). The algorithm stores at most 

3 T sequences, namely T(i" + l,i' r ), T(i" + 1, i' r — 1), and r(ij t ',i^'), where (if is the heavy 
child of (i',,i' r ). By Lemma [2] there are at most log \Aq\ + O(l) light ancestors of (ii,i r ) in Tq and 
therefore the total space for stored T sequences is 0(m log |-Aq|). The additional space used by the 
algorithm is 0(n). Hence, we have the following result. 

Lemma 4 Given nested arc- annotated strings P and Q of lengths m and n, respectively, we can 
solve the nested arc-preserving subsequence problem in time 0(nm) and space 0(m log \Aq\ +n). 

5 Squeezing into Linear Space 

We now show how to compress T sequences into a compact representation using 0(m) bits. Plug- 
ging the new representation into our algorithm the total space becomes 0(n + m) as desired for 
Theorem [U 

Our compression scheme for T sequences relies on the following key property of the values of 7. 



Lemma 5 For any integers ji, 32^1^2, 1 < ji < 32 <Tn, 1 < i\ < ii < n, 

ji - 1 < 7(ii>J2, 21,^2) < tO'i + l,h>h,h) < m 

Proof. Adding another base in front of the substring P[j± + 1, j'2] cannot increase the endpoint of 
an embedding of P\jx + 1, J2] in Q and therefore 7(51, J2>*1)^2) < 7C/1 + l,j2,^i,«2)- Furthermore, 
for any substring P[ji, j'2] we can embed at most 32 — ji bases and at least bases in Q implying 
the remaining inequalities. □ 

Let ii,i2 be indices in Q such that i\ < i2 and consider the sequence 

r(n,«2) = 7( m : m,ii,i 2 ), . . . ,7(l,m, ii,i 2 ) = 7m, • • • ,7i 

By Lemma [5] we have that 7 m ,...,7i is a non-increasing and non- negative sequence where 7 m 
is either m or m — 1. We encode the sequence efficiently using two bit strings V and U defined 
as follows. The string V is formed by the concatenation of m bit strings s m , . . . ,81, that is, 
V = Sm ■ %-i ■ ■ ■ &i) where • denotes concatenation. The string s m is the single bit s m — m 7 m 
and Sk, 1 < k < m, is given by 



if 7 fc+ i - 7 fc = 

if 7fc+i - 7fc > 

^Ik+i-Jk times 
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Let Dk denote the sum of bits in string s m • • • Sj~. We have that m — D m — — s ra — 7m 

and 

inductively m — D^ = 7fc. The string U is the bit string of length |V| consisting of a 1 in each 
position where a substring in V ends. Given V and U we can therefore uniquely recover j m , . . . , j\. 
Since j m , ... ,71 can decrease by at most m + 1 the total number of Is in V is at most m + 1. 
The total number of Os is at most m and therefore |V| < 2m + 1. Hence, our representation uses 
0{m) bits. We can compress 7 m , ... ,71 into V and U in a single scan in 0{m) time. Reversing 
the process we can also decompress in 0{m) time. Hence, we have the following result. 

Lemma 6 We represent any V sequence using 0{m) bits. Compression and decompression takes 
0{m) time. 

We modify our algorithm from Section 0] to take advantage of Lemma Let (ii,i r ) be an internal 
arc in Aq. Immediately before a recursive call to a light child of (ii,ir) we compress 

the at most 3 T sequences maintained at (ii,i r ), namely r(z^,ij?), where (ij 1 ,^) is the heavy child, 
r (i% + 1 , i r ) , and F(i^ + l,i r — 1 ) . Immediately after returning from the recursive call we decompress 
the sequences again. 

The total number of compressions and decompressions is 0(n). Hence, by Lemma [6] the addi- 
tional time used is 0(nm) and therefore the total running time of the algorithm remains 0{nm). 
The space for storing the 0(log \Aq\) T sequences becomes 0(mlog \Aq\) = O(mlogn) bits. Hence, 
the total space is 0(n + m). In conclusion, we have shown Theorem [TJ 

5.1 Avoiding Decompression 

The above algorithm requires 0(n) decompressions. We briefly describe how one can avoid these 
decompressions by augmenting the representation of T sequences slightly. A rank/select index for 
a bit string B supports the operations RANK(i?, k) that returns the number of Is in £?[l,fc] and 
SELECt(.B, k) that returns the position of the kth 1 in B. We can construct a rank/select index 
in 0(|B|) time that uses o(|jB|) bits and supports both operations in constant time [20]. We add 
a rank/select index to the bit strings V and U in our compressed representation. Since these use 
o(m) bits this does not affect the space complexity. Let 7 m , . . . , 71 be a T sequence compressed into 
bit strings V and U augmented with a rank/select index. For any k, 1 < k < m we can compute 
the element 7^ in constant time as 

m — rank(V, seleot(C/ ! m + 1 — k)) 

To see the correctness, first note that SELECt(C7, m + 1 — k) is end position of the m + 1 — kth. 
substring in V. Therefore, rank(V, select(?7, m+1 — k)) is the sum of the bits in the first m + l — k 
substrings of V. This is D^ and since 7^ = m — D^ the computation returns 7^. In summary, we 
have the following result. 

Lemma 7 We can represent any T sequence in 0(m) bits while allowing constant time access to 
any element. 

The algorithm now only needs to compress T sequences once. Whenever, we need an element of a 
compressed T sequence we extract it in constant time as above. Hence, the asymptotic complexity 
of the algorithm remains the same. 
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